115 research outputs found

    Multilevel Hierarchical Kernel Spectral Clustering for Real-Life Large Scale Complex Networks

    Full text link
    Kernel spectral clustering corresponds to a weighted kernel principal component analysis problem in a constrained optimization framework. The primal formulation leads to an eigen-decomposition of a centered Laplacian matrix at the dual level. The dual formulation allows to build a model on a representative subgraph of the large scale network in the training phase and the model parameters are estimated in the validation stage. The KSC model has a powerful out-of-sample extension property which allows cluster affiliation for the unseen nodes of the big data network. In this paper we exploit the structure of the projections in the eigenspace during the validation stage to automatically determine a set of increasing distance thresholds. We use these distance thresholds in the test phase to obtain multiple levels of hierarchy for the large scale network. The hierarchical structure in the network is determined in a bottom-up fashion. We empirically showcase that real-world networks have multilevel hierarchical organization which cannot be detected efficiently by several state-of-the-art large scale hierarchical community detection techniques like the Louvain, OSLOM and Infomap methods. We show a major advantage our proposed approach i.e. the ability to locate good quality clusters at both the coarser and finer levels of hierarchy using internal cluster quality metrics on 7 real-life networks.Comment: PLOS ONE, Vol 9, Issue 6, June 201

    Detection of statistically significant network changes in complex biological networks

    Get PDF
    Table S1. Description of data: GHD and MRA Results for all the 457 considered transcription factors on the TCGA and Rembrandt datasets. (XLSX 62.7 kb

    An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity

    Get PDF
    Disease processes are usually driven by several genes interacting in molecular modules or pathways leading to the disease. The identification of such modules in gene or protein networks is the core of computational methods in biomedical research. With this pretext, the Disease Module Identification (DMI) DREAM Challenge was initiated as an effort to systematically assess module identification methods on a panel of 6 diverse genomic networks. In this paper, we propose a generic refinement method based on ideas of merging and splitting the hierarchical tree obtained from any community detection technique for constrained DMI in biological networks. The only constraint was that size of community is in the range [3, 100]. We propose a novel model evaluation metric, called F-score, computed from several unsupervised quality metrics like modularity, conductance and connectivity to determine the quality of a graph partition at given level of hierarchy. We also propose a quality measure, namely Inverse Confidence, which ranks and prune insignificant modules to obtain a curated list of candidate disease modules (DM) for biological network. The predicted modules are evaluated on the basis of the total number of unique candidate modules that are associated with complex traits and diseases from over 200 genome-wide association study (GWAS) datasets. During the competition, we identified 42 modules, ranking 15th at the official false detection rate (FDR) cut-off of 0.05 for identifying statistically significant DM in the 6 benchmark networks. However, for stringent FDR cut-offs 0.025 and 0.01, the proposed method identified 31 (rank 9) and 16 DMIs (rank 10) respectively. From additional analysis, our proposed approach detected a total of 44 DM in the networks in comparison to 60 for the winner of DREAM Challenge. Interestingly, for several individual benchmark networks, our performance was better or competitive with the winner

    A new efficient and unbiased approach for clustering quality evaluation

    Get PDF
    International audienceTraditional quality indexes (Inertia, DB, . . . ) are known to be method-dependent indexes that do not allow to properly estimate the quality of the clustering in several cases, as in that one of complex data, like textual data. We thus propose an alternative approach for clustering quality evaluation based on unsupervised measures of Recall, Precision and F-measure exploiting the descriptors of the data associated with the obtained clusters. Two categories of index are proposed, that are Macro and Micro indexes. This paper also focuses on the construction of a new cumulative Micro precision index that makes it possible to evalu- ate the overall quality of a clustering result while clearly distinguishing between homogeneous and heterogeneous, or degenerated results. The experimental comparison of the behavior of the classical indexes with our new approach is performed on a polythematic dataset of bibliographical references issued from the PASCAL database

    Characteristic MicroRNAs Linked to Dysregulated Metabolic Pathways in Qatari Adult Subjects With Obesity and Metabolic Syndrome

    Get PDF
    BackgroundObesity-associated dysglycemia is associated with metabolic disorders. MicroRNAs (miRNAs) are known regulators of metabolic homeostasis. We aimed to assess the relationship of circulating miRNAs with clinical features in obese Qatari individuals.MethodsWe analyzed a dataset of 39 age-matched patients that includes 18 subjects with obesity only (OBO) and 21 subjects with obesity and metabolic syndrome (OBM). We measured 754 well-characterized human microRNAs (miRNAs) and identified differentially expressed miRNAs along with their significant associations with clinical markers in these patients.ResultsA total of 64 miRNAs were differentially expressed between metabolically healthy obese (OBO) versus metabolically unhealthy obese (OBM) patients. Thirteen out of 64 miRNAs significantly correlated with at least one clinical trait of the metabolic syndrome. Six out of the thirteen demonstrated significant association with HbA1c levels; miR-331-3p, miR-452-3p, and miR-485-5p were over-expressed, whereas miR-153-3p, miR-182-5p, and miR-433-3p were under-expressed in the OBM patients with elevated HbA1c levels. We also identified, miR-106b-3p, miR-652-3p, and miR-93-5p that showed a significant association with creatinine; miR-130b-5p, miR-363-3p, and miR-636 were significantly associated with cholesterol, whereas miR-130a-3p was significantly associated with LDL. Additionally, miR-652-3p’s differential expression correlated significantly with HDL and creatinine.ConclusionsMicroRNAs associated with metabolic syndrome in obese subjects may have a pathophysiologic role and can serve as markers for obese individuals predisposed to various metabolic diseases like diabetes

    Kernel Spectral Clustering and applications

    Full text link
    In this chapter we review the main literature related to kernel spectral clustering (KSC), an approach to clustering cast within a kernel-based optimization setting. KSC represents a least-squares support vector machine based formulation of spectral clustering described by a weighted kernel PCA objective. Just as in the classifier case, the binary clustering model is expressed by a hyperplane in a high dimensional space induced by a kernel. In addition, the multi-way clustering can be obtained by combining a set of binary decision functions via an Error Correcting Output Codes (ECOC) encoding scheme. Because of its model-based nature, the KSC method encompasses three main steps: training, validation, testing. In the validation stage model selection is performed to obtain tuning parameters, like the number of clusters present in the data. This is a major advantage compared to classical spectral clustering where the determination of the clustering parameters is unclear and relies on heuristics. Once a KSC model is trained on a small subset of the entire data, it is able to generalize well to unseen test points. Beyond the basic formulation, sparse KSC algorithms based on the Incomplete Cholesky Decomposition (ICD) and L0L_0, L1,L0+L1L_1, L_0 + L_1, Group Lasso regularization are reviewed. In that respect, we show how it is possible to handle large scale data. Also, two possible ways to perform hierarchical clustering and a soft clustering method are presented. Finally, real-world applications such as image segmentation, power load time-series clustering, document clustering and big data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms

    Molecular mechanism of RIPK1 and caspase-8 in homeostatic type I interferon production and regulation

    Get PDF
    Type I interferons (IFNs) are essential innate immune proteins that maintain tissue homeostasis through tonic expression and can be upregulated to drive antiviral resistance and inflammation upon stimulation. However, the mechanisms that inhibit aberrant IFN upregulation in homeostasis and the impacts of tonic IFN production on health and disease remain enigmatic. Here, we report that caspase-8 negatively regulates type I IFN production by inhibiting the RIPK1-TBK1 axis during homeostasis across multiple cell types and tissues. When caspase-8 is deleted or inhibited, RIPK1 interacts with TBK1 to drive elevated IFN production, leading to heightened resistance to norovirus infection in macrophages but also early onset lymphadenopathy in mice. Combined deletion of caspase-8 and RIPK1 reduces the type I IFN signaling and lymphadenopathy, highlighting the critical role of RIPK1 in this process. Overall, our study identifies a mechanism to constrain tonic type I IFN during homeostasis which could be targeted for infectious and inflammatory diseases

    An integrated multi-omic approach demonstrates distinct molecular signatures between human obesity with and without metabolic complications: a case–control study

    Get PDF
    Objectives: To examine the hypothesis that obesity complicated by the metabolic syndrome, compared to uncomplicated obesity, has distinct molecular signatures and metabolic pathways. Methods: We analyzed a cohort of 39 participants with obesity that included 21 with metabolic syndrome, age-matched to 18 without metabolic complications. We measured in whole blood samples 754 human microRNAs (miRNAs), 704 metabolites using unbiased mass spectrometry metabolomics, and 25,682 transcripts, which include both protein coding genes (PCGs) as well as non-coding transcripts. We then identified differentially expressed miRNAs, PCGs, and metabolites and integrated them using databases such as mirDIP (mapping between miRNA-PCG network), Human Metabolome Database (mapping between metabolite-PCG network) and tools like MetaboAnalyst (mapping between metabolite-metabolic pathway network) to determine dysregulated metabolic pathways in obesity with metabolic complications. Results: We identified 8 significantly enriched metabolic pathways comprising 8 metabolites, 25 protein coding genes and 9 microRNAs which are each differentially expressed between the subjects with obesity and those with obesity and metabolic syndrome. By performing unsupervised hierarchical clustering on the enrichment matrix of the 8 metabolic pathways, we could approximately segregate the uncomplicated obesity strata from that of obesity with metabolic syndrome. Conclusions: The data suggest that at least 8 metabolic pathways, along with their various dysregulated elements, identified via our integrative bioinformatics pipeline, can potentially differentiate those with obesity from those with obesity and metabolic complications

    Sparsity in Large Scale Kernel Models

    No full text
    In the modern era with the advent of technology and its widespread usage there is a huge proliferation of data. Gigabytes of data from mobile devices, market basket, geo-spatial images, search engines, online social networks etc. can be easily obtained, accumulated and stored. This immense wealth of data has resulted in massive datasets and has led to the emergence of the concept of Big Data. Mining useful information from this big data is a challenging task. With the availability of more data the choices in selecting a predictive model decreases, because very few tools arenbsp;feasible for processing large scale datasets. A successful learning framework to perform various learning tasks like classification, regression, clustering, dimensionality reduction, feature selection etc. is offered by Least Squares Support Vector Machines (LSSVM) which is designed in a primal-dual optimization setting. It provides the flexibility to extend core models by adding additional constraints to the primal problem, by changing the objective function ornbsp;introducing new model selection criteria. The goal of this thesis is to explore the role of sparsity in large scale kernel models using core models adopted from the LSSVM framework. Real-world data is often noisy and only a small fraction of it contains the most relevant information. Sparsity plays a big role in selection of this representative subset of data. We first explored sparsity in the case of large scale LSSVM using fixed-size methods with a re-weighted L1 penalty on top resulting in very sparse LSSVM (VS-LSSVM). An important aspect of kernel based methods is the selection of a subset on which the model is built and validated. We proposed a novel fast and unique representative subset (FURS) selection technique to select a subset from complex networks which retains the inherent community structure in the network. We extend this method for Big Data learning by constructing k-NN graphs out of dense data using a distributed computing platform i.e. Hadoop and then apply the FURS selection technique to obtain representative subsets on top of which models are built by kernel based methods. We then focused on scaling the kernel spectralnbsp;(KSC) technique for big data networks. We devised two model selection techniques namely balanced angular fitting (BAF) and self-tuned KSC (ST-KSC) by exploiting the structure of the projections in the eigenspace to obtain the optimal number of communities k in the large graph. A multilevel hierarchical kernel spectral clustering (MH-KSC) technique was then proposed which performs agglomerative hierarchical clustering using similarity information between the out-of-sample eigen-projections. Furthermore, we developed an algorithm to identify intervals for hierarchical clustering using the Gershgorin Circle theorem. These intervals were used to identify the optimal number of clusters at a given level of hierarchy in combination with KSC model. The MH-KSC technique was extended from networks to images and datasets using the BAF model selection criterion. We also proposed optimal sparse reductions to KSC model by reconstructing the model using a reduced set. We exploited the Group Lasso and convex re-weighted L1 penalty to sparsify the KSC model. Finally, we explored the role of re-weighted L1 penalty in case of feature selection in combination with LSSVM. We proposed a visualization (Netgram) toolkit to track the evolution of communities/clusters over time in case of dynamic time-evolving communities and datasets. Real world applications considered in this thesis include classification and regression of large scale datasets, image segmentation, flat and hierarchical community detection in large scale graphs and visualization of evolving communities.nrpages: 238status: publishe
    corecore